Analysis of Senator Stock Trading

Patrick Cao

Introduction

Insider trading, or trading financial assets using information that is not open to the public, is punishable by law in the United States. It is a serious violation -- a maximum penalty of 20 years in prison, and up to $20 million in fines. The Securities and Exchange Commission (SEC) is responsible for identifying and conducting investigations regarding insider trading.

With that being said, U.S. senators (and other congresspeople) often have access to information about upcoming policies or laws before the general public, yet are still allowed to trade securities. Senators, however, must disclose their trades publicly. More recently, a few senators have been investigated by the SEC for insider trading using nonpublic knowledge given to them about the COVID-19 pandemic. With the rise in popularity of investing, senator stock trading patterns has become to the public's attention. You may have seen posts on the internet indicating the timing of senators trades -- right before major crashes or rises -- is very good, implying that there is rampant insider trading going on in the senate. For example, take this post on Reddit that was upvoted over 30,000 times: https://www.reddit.com/r/dataisbeautiful/comments/gjlvnd/.

Is insider trading in the senate as rampant as this post suggests? This tutorial aims to identify potential insider trading patterns by the U.S. senate. In order to answer this question, we will be using data from https://senatestockwatcher.com/, which itself compiles data from https://efdsearch.senate.gov/. efdsearch.senate.gov does not have an official API, and scraping each page would be extremely tedious. Senate stock watcher aggregates financial disclosures from all senators into one nice dataset.

Let's first start by downloading the dataset. While Senate Stock Watcher provides an API for up to date datasets, we can just download a csv file as well. Download the "All Transactions" csv file from https://senatestockwatcher.com/api.html.

Let's load that into a pandas dataframe.

In [17]:
import pandas as pd
import numpy as np
import datetime

df = pd.read_csv('all_transactions.csv')
df.head()
Out[17]:
transaction_date owner ticker asset_description asset_type type amount comment senator ptr_link
0 11/16/2020 Spouse BA The Boeing Company Stock Purchase $15,001 - $50,000 R Pat Roberts https://efdsearch.senate.gov/search/view/ptr/b...
1 11/16/2020 Spouse V Visa Inc. Stock Purchase $15,001 - $50,000 555 Pat Roberts https://efdsearch.senate.gov/search/view/ptr/b...
2 11/24/2020 Spouse COST Costco Wholesale Corporation Stock Purchase $15,001 - $50,000 555 Pat Roberts https://efdsearch.senate.gov/search/view/ptr/b...
3 11/04/2020 Spouse COST Costco Wholesale Corporation Stock Sale (Full) $15,001 - $50,000 555 Pat Roberts https://efdsearch.senate.gov/search/view/ptr/b...
4 11/09/2020 Spouse BA The Boeing Company Stock Sale (Full) $15,001 - $50,000 sep Pat Roberts https://efdsearch.senate.gov/search/view/ptr/b...

Let's clean this up a little by limiting and purchases and sales only, and dropping unnecessary columns.

In [18]:
start_date = '2016-01-01'
end_date = '2020-11-01'

# Turn transaction_date from string into datetime
df['transaction_date'] = pd.to_datetime(df['transaction_date'])

# Remove (Full) and (Partial) after Sale type
df['type'] = df['type'].apply(lambda x: str(x).split()[0])

# Limit to only purchases and sales
df = df[(df['type'] == 'Purchase') | (df['type'] == 'Sale')]

# limit the dataset to a date range
df = df[(df['transaction_date'] > start_date) & (df['transaction_date'] < end_date)]

df = df.drop(['asset_description', 'comment', 'ptr_link'], axis=1)
df.head()
Out[18]:
transaction_date owner ticker asset_type type amount senator
76 2020-10-13 Joint -- Corporate Bond Sale $1,001 - $15,000 Sheldon Whitehouse
78 2019-04-05 Spouse NVS Stock Sale $1,001 - $15,000 William Cassidy
79 2020-10-16 Child NVDA Stock Purchase $1,001 - $15,000 Ron L Wyden
80 2020-10-16 Child WDAY Stock Purchase $1,001 - $15,000 Ron L Wyden
81 2020-10-16 Child NVDA Stock Purchase $1,001 - $15,000 Ron L Wyden

We got our senator stock trading data, but now we need actual historical market data to visualize these trades. Let's use yfinance, a python Yahoo! Finance library, to get historical market data on SPY, an ETF tracking the S&P500.

In [19]:
import yfinance as yf

spy = yf.Ticker('SPY')
spy_hist = spy.history(period='max')
spy_hist.tail()
Out[19]:
Open High Low Close Volume Dividends Stock Splits
Date
2020-12-07 369.019989 369.619995 367.720001 369.089996 48944300 0.0 0
2020-12-08 367.720001 370.779999 367.670013 370.170013 42458900 0.0 0
2020-12-09 370.880005 371.049988 365.950012 366.850006 74098300 0.0 0
2020-12-10 365.369995 367.859985 364.429993 366.730011 57617300 0.0 0
2020-12-11 364.899994 366.740204 363.260010 366.140015 57698614 0.0 0

Neat! Now we can start putting these two things together.

Exploratory Data Analysis

Senator Data

Let's take an initial look at the data. It would be interesting to see the most traded stocks, the most active senators, or the average value of each trade.

In [20]:
import matplotlib.pyplot as plt

# number of trades by senators
num_trades = df.groupby('senator').count().sort_values(by='transaction_date', ascending=False).head(10)
plt.title('Top 10 Most Active Senators')
plt.xlabel('Total Number of Trades')
plt.ylabel('Senator')
plt.barh(num_trades.index, num_trades['transaction_date'])
plt.show()

# number of trades by ticker
trades_by_ticker = df[df['ticker'] != '--'].groupby('ticker').count().sort_values(by='transaction_date', ascending=False).head(20)
plt.title('Top 20 Most Traded Stocks')
plt.xlabel('Total Number of Trades')
plt.ylabel('Ticker')
plt.barh(trades_by_ticker.index, trades_by_ticker['transaction_date'])
plt.show()

# values of trades
trade_val = df.groupby('amount').count().sort_values(by='transaction_date', ascending=False)
plt.title('Value of Trades')
plt.xlabel('Number of Trades')
plt.ylabel('Value of Trade ($)')
plt.barh(trade_val.index, trade_val['transaction_date'])
plt.show()

It looks like most trades by senators are small -- under $15,000 per trade -- and they like trading AAPL, which is coincidentally the largest market cap company in the world. There's a good mix of tech, pharma, communications, and retail stocks that are the most commonly traded in the senate. There's not too much insight we can pull from this, but it is interesting.

Let's now try to recreate the plot on the reddit post. We need to plot overall senator trading activity against the S&P500, or SPY.

It's a little tough to track performance exactly. The size of each trade isn't an exact number -- it's a range. Let's check out what kinds of values we're working with here.

In [21]:
df['amount'].unique()
Out[21]:
array(['$1,001 - $15,000', '$15,001 - $50,000', '$1,000,001 - $5,000,000',
       '$250,001 - $500,000', '$500,001 - $1,000,000',
       '$50,001 - $100,000', '$100,001 - $250,000',
       '$5,000,001 - $25,000,000'], dtype=object)

Yikes. The data doesn't have the exact value of securities traded, and the granularity of each bucket is pretty large. The best thing we can probably do is estimate the size of each trade by just taking the middle value of each trade, and just count the number of buys and sells each day. Thus, we'll have 2 metrics to try and predict future SP500 prices: number of buys and sells, and value of all trades.

In [22]:
# Returns the average of the given bound
def getBounds(row):
    
    amount = row[0]
    trade_type = row[1]

    # This amount shows up once in the entire dataset. Let's just assume the trade is $50m for now. It won't affect our analysis too much.
    if amount == 'Over $50,000,000':
        return 5e7
    
    split = amount.split(' - ')
    lower = int(split[0][1:].replace(',', ''))
    upper = int(split[1][1:].replace(',', ''))
    mid = lower + ((upper - lower) / 2)

    if trade_type == 'Purchase':
        return mid
    return -mid

df['est_amt'] = pd.Series(df[['amount', 'type']].apply(getBounds, axis=1))

df
Out[22]:
transaction_date owner ticker asset_type type amount senator est_amt
76 2020-10-13 Joint -- Corporate Bond Sale $1,001 - $15,000 Sheldon Whitehouse -8000.5
78 2019-04-05 Spouse NVS Stock Sale $1,001 - $15,000 William Cassidy -8000.5
79 2020-10-16 Child NVDA Stock Purchase $1,001 - $15,000 Ron L Wyden 8000.5
80 2020-10-16 Child WDAY Stock Purchase $1,001 - $15,000 Ron L Wyden 8000.5
81 2020-10-16 Child NVDA Stock Purchase $1,001 - $15,000 Ron L Wyden 8000.5
... ... ... ... ... ... ... ... ...
7468 2016-01-06 Self UA Stock Purchase $50,001 - $100,000 John Hoeven 75000.5
7470 2016-11-23 Joint -- Stock Sale $1,001 - $15,000 Patrick J Toomey -8000.5
7471 2016-01-04 Joint -- Stock Sale $1,001 - $15,000 Patrick J Toomey -8000.5
7472 2016-01-05 Joint -- Stock Sale $15,001 - $50,000 Patrick J Toomey -32500.5
7473 2016-01-05 Joint -- Stock Purchase $1,001 - $15,000 Patrick J Toomey 8000.5

6539 rows × 8 columns

Alright, now we got some rough estimates of the value of each transaction, we can group trades together by date and find the aggregate for each day.

In [23]:
aggregate_action_with_type = df.groupby([pd.Grouper(key='transaction_date', freq='D'), 'asset_type', 'type']).agg(est_amt=('est_amt', 'sum'), num_trades=('est_amt', 'count')).reset_index()

# because a sale is the opposite of a purchase, we want to show that
aggregate_action_by_asset = aggregate_action_with_type.copy()
aggregate_action_by_asset['num_purchases'] = aggregate_action_by_asset.apply(lambda row: -row['num_trades'] if row['type'] == 'Sale' else row['num_trades'], axis=1)

aggregate_action_by_asset = aggregate_action_by_asset.groupby(['transaction_date', 'asset_type']).sum().reset_index()
aggregate_action = aggregate_action_by_asset.groupby('transaction_date').sum().reset_index()
print(aggregate_action_by_asset.head())
print(aggregate_action.head())
  transaction_date          asset_type   est_amt  num_trades  num_purchases
0       2016-01-04               Stock   -8000.5           1             -1
1       2016-01-05  Municipal Security   32500.5           1              1
2       2016-01-05               Stock   47006.0          22             12
3       2016-01-06               Stock  139004.5          13              9
4       2016-01-07  Municipal Security   65001.0           2              2
  transaction_date   est_amt  num_trades  num_purchases
0       2016-01-04   -8000.5           1             -1
1       2016-01-05   79506.5          23             13
2       2016-01-06  139004.5          13              9
3       2016-01-07  129005.0          10             10
4       2016-01-08  -32500.5           7             -1

We are now ready to plot senator trades against the market.

In [33]:
spy_hist['Date'] = spy_hist.index

# since we've limited the senator trades to a date range, we want to also limit SPY history
# to the same date range
market = spy_hist[(spy_hist['Date'] > start_date) & (spy_hist['Date'] < end_date)]

def plot_trades_vs_spy(trades, asset_type, metric, y_label):

    # We have a lot of data, so we should make the figure very wide
    plt.figure(figsize=(18,6))

    plt.plot(market['Date'], market['Close'])

    plt.xlabel('Year')
    plt.ylabel('SPY price ($)')
    ax2 = plt.twinx()

    # scale bar graph limits s.t. 0 is centered in graph
    limit = max(abs(trades[metric].min()), trades[metric].max()) * 1.1
    ax2.set_ylim(-limit, limit)
    ax2.spines['bottom'].set_position(('data', 0))

    plt.ylabel(y_label)

    plt.bar(trades['transaction_date'], trades[metric], width=5, color='orange')
    plt.title('Senator Trades of ' + str(asset_type) +  ' (' + metric + ')' + ' vs. SPY price')

    plt.show()

for asset_type in aggregate_action_by_asset['asset_type'].unique():
    trades_by_asset = aggregate_action_by_asset[aggregate_action_by_asset['asset_type'] == asset_type]
    plot_trades_vs_spy(trades_by_asset, asset_type, 'num_purchases', 'Number of Purchases/Day')
    plot_trades_vs_spy(trades_by_asset, asset_type, 'est_amt', 'Net Securities Traded/Day ($)')

plot_trades_vs_spy(aggregate_action, 'Overall', 'num_purchases', 'Number of Purchases/Day')
plot_trades_vs_spy(aggregate_action, 'Overall', 'est_amt', 'Net Securities Traded/Day ($)')

Is there a relationship between these trades and future SPY price? It would be helpful to try and quantify this and instead plot this data on a scatterplot. Let's plot these trades vs. future SPY price as well. To find the price of SPY a number of days in the future, we need to know all the open market days. Luckily, there is a library for that.

But how long in the future should we check? It's probably not feasible to check a lot of different number of days in the future, so we have to pick this value somewhat arbitrarily. Let's just pick 1 week and 3 weeks. There are 5 trading days per week, so when looking at future trading days, we could look at SPY prices 5 and 15 days in the future.

In [36]:
import pandas_market_calendars as mcal
from datetime import datetime

nyse = mcal.get_calendar('NYSE')
market_open_days = nyse.valid_days(start_date=start_date, end_date=datetime.today().strftime('%Y-%m-%d'))

# returns the spy price num_days from today
def get_future_spy_price(num_days, today):
    # calculate current spy data
    today_index = market_open_days.get_loc(today, method='nearest')
    today_close = spy_hist.iloc[spy_hist.index.get_loc(today, method='nearest')]['Close']

    # calculate future date and get spy data
    future_index = today_index + num_days
    future_date = market_open_days[future_index]
    future_close = spy_hist.iloc[spy_hist.index.get_loc(future_date, method='nearest')]['Close']
    return 100 * (future_close - today_close) / today_close


aggregate_action_by_asset['5_day_spy_pct_change'] = aggregate_action_by_asset['transaction_date'].apply(lambda date: get_future_spy_price(5, date))
aggregate_action_by_asset['15_day_spy_pct_change'] = aggregate_action_by_asset['transaction_date'].apply(lambda date: get_future_spy_price(15, date))
aggregate_action['5_day_spy_pct_change'] = aggregate_action['transaction_date'].apply(lambda date: get_future_spy_price(5, date))
aggregate_action['15_day_spy_pct_change'] = aggregate_action['transaction_date'].apply(lambda date: get_future_spy_price(15, date))

aggregate_action
Out[36]:
transaction_date est_amt num_trades num_purchases 5_day_spy_pct_change 15_day_spy_pct_change
0 2016-01-04 -8000.5 1 -1 -4.432377 -5.382561
1 2016-01-05 79506.5 23 13 -3.823988 -6.570312
2 2016-01-06 139004.5 13 9 -5.024638 -4.883805
3 2016-01-07 129005.0 10 10 -1.092534 -0.170072
4 2016-01-08 -32500.5 7 -1 -2.141516 0.901415
... ... ... ... ... ... ...
1053 2020-10-08 32500.5 1 1 1.082088 -4.014192
1054 2020-10-13 -24001.5 3 -3 -1.927855 -4.027077
1055 2020-10-16 48003.0 6 6 -0.434798 0.826397
1056 2020-10-27 -3000000.5 1 -1 -0.647508 6.622906
1057 2020-10-30 8000.5 3 1 7.233415 8.816677

1058 rows × 6 columns

Now that we have our data, we can begin plotting. We want to plot value of trades vs. future value and number of purchases vs. future value for both 5 and 15 days in the future.

In [34]:
def set_subplot(ax, x, y, title, x_label, y_label):
    ax.scatter(x, y)
    ax.set_title(title)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.grid()

def plot_trades_vs_future_spy_price(trades, asset_type):
    fig, axs = plt.subplots(1, 4, figsize=(20, 4))
    fig.suptitle('Trades vs. Future SPY Price % Change for ' + asset_type)

    set_subplot(axs[0], trades['est_amt'], trades['5_day_spy_pct_change'], '5 day', 'Net Value of Assets Bought/Sold', '5 Day % Change in S&P500')    
    set_subplot(axs[1], trades['num_purchases'], trades['5_day_spy_pct_change'], '5 day', 'Number of Purchases (Sales count as negative)', '5 Day % Change in S&P500')

    set_subplot(axs[2], trades['est_amt'], trades['15_day_spy_pct_change'], '15 day', 'Net Value of Assets Bought/Sold', '15 Day % Change in S&P500')    
    set_subplot(axs[3], trades['num_purchases'], trades['15_day_spy_pct_change'], '15 day', 'Number of Purchases (Sales count as negative)', '15 Day % Change in S&P500')
    plt.show()

for asset_type in aggregate_action_by_asset['asset_type'].unique():
    trades_by_asset = aggregate_action_by_asset[aggregate_action_by_asset['asset_type'] == asset_type]
    plot_trades_vs_future_spy_price(trades_by_asset, asset_type)

plot_trades_vs_future_spy_price(aggregate_action, 'Overall')

There aren't any obvious trends in most of the categories. However, we can still try to uncover trends by building linear regression models and performing hypothesis tests with this data. More on that in the machine learning and hypothesis testing section.

Evaluating Every Single Trade, and Grouping by Senator

The SP500 represents the aggregate performance of 500 different companies in the index. Although the stock market largely follows the same trends, individual stocks may react differently to world events. Thus, more insight may be gained if we analyze the performance of individual stock picks by senators. For example, David Purdue bought stock in PPE manufacturers just before COVID-19 panic started to spread in the U.S.

Additionally, insider trading is, in fact, a pretty big deal, and so I would imagine that not many senators would partake in it. Let's now visualize the timing performance of each senator separately -- senators who insider trade probably do it multiple times.

In [29]:
hists = {}

# returns a function that gets the spy price num_days from today
def get_future_stock_price(ticker, today, num_days):
    try:
        stock_history = yf.Ticker(ticker).history(period='5y')

        # calculate current stock data
        today_index = market_open_days.get_loc(today, method='nearest')
        today_close = stock_history.iloc[stock_history.index.get_loc(today, method='nearest')]['Close']

        # calculate future date and get stock data
        future_index = today_index + num_days
        future_date = market_open_days[future_index]
        future_close = stock_history.iloc[stock_history.index.get_loc(future_date, method='nearest')]['Close']
        return (future_close - today_close) / today_close

    # yfinance sometimes cannot find certain tickers -- just return nan on error
    except Exception as e:
        return np.nan
In [39]:
stock_trades = df[df['ticker'] != '--']

# we're processing an insane amount of data here, so let's save this dataframe back to the disk
# so we can read it back easily
try:
    stock_trades = pd.read_csv('stock_trading_performance.csv')
except FileNotFoundError:
    print('calculating...')
    stock_trades['5_day_pct_change'] = stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 5), axis=1)
    print('done with 5 day')
    stock_trades['15_day_pct_change'] = stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 15), axis=1)
    print('done with 15 day')
    stock_trades['30_day_pct_change'] = stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 30), axis=1)
    print('done with 15 day')
    # stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 5), axis=1)
    stock_trades.to_csv('stock_trading_performance.csv')

stock_trades.head()
Out[39]:
Unnamed: 0 transaction_date owner ticker asset_type type amount senator lower_amt mid_amt upper_amt 5_day_pct_change 15_day_pct_change 30_day_pct_change est_amt
0 78 2019-04-05 Spouse NVS Stock Sale $1,001 - $15,000 William Cassidy -1001 -8000.5 -15000 -0.047262 -0.036695 -0.026246 -8000.5
1 79 2020-10-16 Child NVDA Stock Purchase $1,001 - $15,000 Ron L Wyden 1001 8000.5 15000 -0.016019 0.054339 -0.029685 8000.5
2 80 2020-10-16 Child WDAY Stock Purchase $1,001 - $15,000 Ron L Wyden 1001 8000.5 15000 -0.023142 -0.019394 -0.020309 8000.5
3 81 2020-10-16 Child NVDA Stock Purchase $1,001 - $15,000 Ron L Wyden 1001 8000.5 15000 -0.016019 0.054339 -0.029685 8000.5
4 82 2020-10-16 Child WDAY Stock Purchase $1,001 - $15,000 Ron L Wyden 1001 8000.5 15000 -0.023142 -0.019394 -0.020309 8000.5
In [45]:
def plot_future_stock_performance(trades, senator):
    fig, axs = plt.subplots(1, 3, figsize=(20, 5))
    fig.suptitle('Trade Value vs. Future Stock Price % Change for Senator ' + senator)

    axs[0].scatter(trades['est_amt'], trades['5_day_pct_change'])
    axs[0].set_title('5 Day')
    axs[0].set_xlabel('Value of Assets Bought/Sold')
    axs[0].set_ylabel('5 Day % Change in Stock Price')
    axs[0].grid()

    axs[1].scatter(trades['est_amt'], trades['15_day_pct_change'])
    axs[1].set_title('15 Day')
    axs[1].set_xlabel('Value of Assets Bought/Sold')
    axs[1].set_ylabel('15 Day % Change in Stock Price')
    axs[1].grid()

    axs[2].scatter(trades['est_amt'], trades['30_day_pct_change'])
    axs[2].set_title('30 Day')
    axs[2].set_xlabel('Value of Assets Bought/Sold')
    axs[2].set_ylabel('30 Day % Change in Stock Price')
    axs[2].grid()

    plt.show()

plot_future_stock_performance(stock_trades, 'Overall')

for senator in stock_trades['senator'].unique():
    stock_trades_by_senator = stock_trades[stock_trades['senator'] == senator]
    plot_future_stock_performance(stock_trades_by_senator, senator)
In [ ]: